Skip to content

feat: Implement Observability 2.0 wide events#41

Merged
adewale merged 29 commits intomainfrom
claude/logging-guidelines-spec-Vb5VM
Jan 16, 2026
Merged

feat: Implement Observability 2.0 wide events#41
adewale merged 29 commits intomainfrom
claude/logging-guidelines-spec-Vb5VM

Conversation

@adewale
Copy link
Copy Markdown
Owner

@adewale adewale commented Jan 15, 2026

Summary

Implement lifecycle-based wide events per the Observability 2.0 pattern, replacing per-action KV logging with structured events emitted to Cloudflare Workers Logs.

Two event types

  • http_request — One per HTTP request with embedded errors, timing, and context
  • ws_session — One per WebSocket connection, emitted at disconnect with full session stats

Key features

  • Creator detection via IP + User-Agent hash (persists across page refreshes)
  • Sync health tracking — syncRequestCount, syncErrorCount for debugging client sync issues
  • Response size tracking for key endpoints
  • Deployment metadata — version ID, colo, country from Cloudflare bindings
  • Warning collection for recovered errors and near-misses

New tooling

  • npm run test:e2e:full-stack — Run E2E tests against wrangler dev (full Cloudflare stack)
  • /api/health endpoint for monitoring and test runners

Files changed

Area Files
Core observability.ts, route-patterns.ts (new)
Worker index.ts, live-session.ts, logging.ts
Config wrangler.jsonc, types.ts
Tests playwright.config.ts, test-e2e-full-stack.ts (new)
Docs 26 commits of research and spec work

Test plan

  • 4022 unit tests pass
  • TypeScript compiles
  • Lint passes
  • E2E smoke tests pass against wrangler dev (16/16)
  • Deployed to staging.keyboardia.dev

🤖 Generated with Claude Code

claude and others added 29 commits January 15, 2026 15:46
Extract actionable guidelines from modern logging philosophy and analyze
delta with Keyboardia's current observability approach. Key insights:

- Wide events (emit once per lifecycle) vs narrow events (current)
- Tail sampling strategy to reduce KV writes by 90%
- Correlation IDs for end-to-end tracing
- Structured JSON console output for wrangler tail

Includes 4-phase implementation roadmap with code examples.
Incorporate industry best practices from Cloudflare Workers docs and
Charity Majors / Honeycomb philosophy:

- Add Observability 1.0 vs 2.0 framework (single source of truth)
- Add Cloudflare-native config (wrangler.toml, Workers Logs, head sampling)
- Recommend Workers Logs over KV storage (7-day retention, no quota impact)
- Add derived metrics principle (compute from events, don't collect separately)
- Add query-first mindset for unknown unknowns
- Add automatic instrumentation section (invocation_logs, traces)
- Expand sampling to cover head + tail strategy
- Add SQL query examples for Workers Logs Query Builder
- Update sources with Cloudflare and Honeycomb references

Note: OpenTelemetry export intentionally excluded per requirements.
Organizational changes reflecting the generational shift:
- Rename LOGGING-GUIDELINES.md → OBSERVABILITY-2-0.md
- Move OBSERVABILITY.md → specs/archive/ (Phase 7 implementation)
- Add archive notice to legacy doc

Content additions to Observability 2.0:
- Add HTTP wide events for session lifecycle (create, access, publish, remix)
- Add retention strategy for 30-day admin metrics via KV daily rollups
- Add scheduled worker pattern for aggregating Workers Logs → KV
- Define query strategy by time range (Workers Logs vs KV)

This enables admin dashboards to answer:
- Sessions created/published/remixed over 1, 7, 30 days
- Remix rates and most-remixed sessions
- Long-term trends beyond Workers Logs 7-day retention
Changes:
- Remove section 2.6 (KV-based 30-day retention strategy)
  Accept Workers Logs 7-day limit as the retention boundary

- Add Appendix D: Complete Wide Events Catalog
  Comprehensive event definitions from codebase audit:

  Session Lifecycle (5 events):
  - session_created, session_accessed, session_updated
  - session_published, session_remixed

  WebSocket Events (3 events):
  - ws_session_end (primary wide event with full context)
  - ws_player_joined, ws_player_left

  Error Events (6 events):
  - error_rate_limit, error_quota_exceeded, error_validation
  - error_ws_connection, error_invariant, error_mutation_rejected

  Sync Events (3 events):
  - sync_hash_mismatch, sync_snapshot_sent, sync_client_behind

  Playback Events (2 events):
  - playback_started, playback_stopped

- Add volume estimates (~3,000-25,000 events/day)
- Add "What we intentionally don't track" section
- Add query recipes for common analytics questions

Total: 19 wide event types covering complete system observability
Replace vague ranges with explicit calculations based on:
- 500 DAU baseline assumption
- Per-user behavior rates (sessions created, accesses, multiplayer adoption)
- Derived event volumes with clear math

Add scaling projection table showing linear growth from 50 to 500K DAU.
Even at 500K DAU, event volume is only 0.2% of Workers Logs limit.
Convert from 1,177-line implementation spec to 155-line research document.

Changes:
- Remove implementation phases, code examples, event schemas
- Remove appendices (wide event catalog, sampling tree, query recipes)
- Keep core philosophy (Obs 1.0 vs 2.0, wide events pattern)
- Keep sources (Charity Majors, loggingsucks.com, Cloudflare)
- Keep applicability analysis and trade-offs
- Add clear recommendation: "No immediate action required"

Restore OBSERVABILITY.md as the active implementation doc.
OBSERVABILITY-2-0.md is now reference material for future decisions.
…arch

- Update event volume estimates with 1,000 DAU as primary baseline
- Add 30 DAU early launch and growth scenarios for context
- Add cost analysis section showing Workers Logs pricing
- Document that Obs 2.0 would slightly reduce costs while improving retention
Defines three wide events:
- http_request_end: Full HTTP request lifecycle
- ws_session_end: WebSocket connection lifecycle with message stats
- error: Structured error tracking

Includes TypeScript schemas, examples, implementation patterns,
migration path, and effort estimates (~13 hours total).
- Rename OBSERVABILITY-2-0-IMPL.md to OBSERVABILITY-2-0-IMPLEMENTATION.md
- Move OBSERVABILITY-2-0.md to specs/research/ (research docs folder)
- Update cross-references between documents
- Change config examples from wrangler.toml to wrangler.jsonc
- Replace userId with playerId (matches Keyboardia terminology)
- Add playerId to http_request_end example
- Fix typo: oderId → playerId in ws_session_end schema
- Add "sessions created per unique user" to queryable questions
- Add "Designing Wide Events" section with 6 principles
- Include litmus test for event width
- Update config examples from wrangler.toml to wrangler.jsonc
Key changes:
- Add isCreator boolean to ws_session_end (most users are joiners)
- Add action: "access" for session joins (vs "create")
- Add Design Decisions section with included/excluded tables
- Add Typical Traces showing creator vs joiner flows
- Add Architecture Sequence Diagram showing all layers
- Update examples to show joiner perspective (majority case)
New field enables answering:
- Are people mostly consuming published content?
- Do people spend more time on published vs editable sessions?
- Which published sessions get the most views/attention?

Updated both http_request_end and ws_session_end schemas, examples,
queryable questions, and design decision tables.
Add insights from Charity Majors, Stripe's Canonical Log Lines:
- "High cardinality is the feature, not the bug"
- "Never aggregate at write-time" principle
- The "Stuff the Blob" pattern from Stripe
- Anti-patterns to avoid (low-dimensionality, PII, grep-oriented logs)
Document how engagement signals differ between consumption modes:
- Published: passive (play/stop), solo viewing, shallow but broad
- Editable: active (toggle_step, etc.), collaboration, deep but narrow
Key additions to http_request_end:
- sourceSessionId: Track which session was remixed/published FROM
- deviceType: mobile vs desktop segmentation (derived from User-Agent)

New examples showing:
- Remix action with sourceSessionId for virality tracking
- Publish action capturing the publishing flow

New queryable questions:
- "Which sessions generated the most remixes?"
- "Are mobile users more likely to consume or create?"

Documents isCreator derivation (Option 4): correlate playerId from
action="create" with ws_session_end events for same sessionId.
- isCreator now determined by comparing CF-Connecting-IP + User-Agent hash
  with stored creatorIdentity from session creation
- More reliable than ephemeral playerId (server-generated per connection)
- Added CreatorIdentity interface and hashUserAgent helper
- Documented limitations (VPN changes, different browsers)
- Updated trace diagrams to show IP-based creator identification
- Fix async/await in code examples (handleSessionCreate, handleWebSocketConnect)
- Fix Map<string, WsContext> to Map<WebSocket, WsContext>
- Clarify IP address exclusion (used server-side for isCreator, not logged)
- Add cross-reference to implementation spec from OBSERVABILITY.md
- Align event volume estimates: ~11,500/day at 1K DAU (was ~16,000 in research)
- Align effort estimates: ~13 hours (was ~15 in research)
- Add note that HTTP middleware example is simplified
- Add cross-reference from research doc to implementation spec
Explains why the event is guaranteed (Cloudflare ping/pong) and timing
implications for clean vs dirty disconnects.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Phase 1: POST /api/errors endpoint
- Phase 2: sendBeacon on pagehide
- Phase 3: WebSocket client_error message type

Includes transport flow diagram, client-specific schema fields,
and error type classification table.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add warnings?: Warning[] to HttpRequestEndEvent and WsSessionEndEvent
- Define Warning type with recoveryAction discriminator
- Document warning types (KVReadRetry, SlowDO, StateRepair, etc.)
- Add collection mechanism: explicit parameter for HTTP, instance Map for WS
- Max 10 warnings per event to prevent unbounded growth

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…vents

- Add deploy object (versionId, versionTag, deployedAt) from CF_VERSION_METADATA
- Add infra object (colo, country) from request.cf
- Add service object (name, environment) for identity
- Document wrangler.jsonc configuration for version_metadata binding
- Include code examples for accessing CF metadata in handlers

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add slug field for machine-readable error classification
- Add expected boolean to distinguish anticipated vs unexpected errors
- Add deploy/infra/service metadata for consistency with other events
- Update "What to report" table with expected values and example slugs
- Add slug naming convention guidance

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add deploy/infra/service to all three event tables
- Add slug and expected to error event table
- Clarify geo exclusion (detailed geo excluded, colo+country sufficient)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…vents

Major restructuring per Boris Tane / Honeycomb guidance:

- Rename http_request_end → http_request, ws_session_end → ws_session
- Remove separate `error` event type (violates wide event philosophy)
- Add `outcome: "ok" | "error"` field to both events (Boris Tane pattern)
- Add `error` object with type, message, slug, expected, handler, stack
- Keep `client_error` as exception (no parent server-side unit of work)
- Update Design Decisions tables
- Update Event Volume estimates
- Add error example for http_request

Key insight: "ONE event per unit of work" - errors should be embedded
in the parent event for full context correlation, not separate events.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace per-action KV logging with lifecycle-based wide events emitted
to Cloudflare Workers Logs via console.log(JSON.stringify(...)).

Two event types implemented:
- http_request: One per HTTP request with embedded errors
- ws_session: One per WebSocket connection, emitted at disconnect

Key changes:
- Add observability.ts with event schemas, helpers, and emission
- Add route-patterns.ts for route pattern matching
- Update index.ts to emit http_request events instead of KV logs
- Update live-session.ts with PlayerObservability tracking and ws_session emission
- Clean up logging.ts to keep only state hashing utilities
- Configure wrangler.jsonc with observability, version_metadata, and env vars
- Update spec to defer client_error (violates wide event principle)

Events include deployment metadata (version ID, tag), infrastructure
info (colo, country), and service identity. Errors are embedded with
type, message, slug, expected flag, and optional stack trace.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Complete the three previously deferred items in Observability 2.0:

1. isCreator Detection (IP + User-Agent hash):
   - Add CreatorIdentity interface and helper functions
   - Store creator identity on first WebSocket connect
   - Persist to DO storage for hibernation survival
   - Compare on subsequent connections to detect creator
   - Include in ws_session event emission

2. syncRequestCount and syncErrorCount:
   - Track client-requested snapshot recovery
   - Track proactive sync when client falls behind (ACK gap)
   - Track sync errors when state is unavailable
   - Include counters in ws_session event

3. responseSize in HTTP events:
   - Add responseSize option to emitEvent helper
   - Calculate size using TextEncoder for key endpoints
   - Include in events for: session create, GET, remix, publish

All items verified by audit against spec.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test:e2e:full-stack script that runs E2E tests against the complete
Cloudflare Worker stack (wrangler dev) instead of just the Vite dev server.
This enables testing of Durable Objects, KV storage, and Worker API endpoints.

Changes:
- Add scripts/test-e2e-full-stack.ts: builds project, starts wrangler dev,
  runs Playwright tests, and cleans up
- Add /api/health endpoint for test runner health checks
- Update playwright.config.ts to support PLAYWRIGHT_BASE_URL env var
- Fix hardcoded URL in pitch-contour-alignment.spec.ts to use API_BASE
- Add npm scripts: test:e2e:full-stack and test:e2e:full-stack:smoke

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@adewale adewale changed the title docs: Add logging guidelines spec based on loggingsucks.com analysis feat: Implement Observability 2.0 wide events Jan 16, 2026
@adewale adewale merged commit 38bd234 into main Jan 16, 2026
6 checks passed
@adewale adewale deleted the claude/logging-guidelines-spec-Vb5VM branch January 16, 2026 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants